TaxaMiner: Improving Taxonomy Label Quality Using Latent Semantic Indexing

نویسندگان

  • Cartic Ramakrishnan
  • Christopher Thomas
  • Vipul Kashyap
  • Amit Sheth
چکیده

The development of taxonomies/ontologies is a human intensive process requiring prohibitively large resource commitments in terms of time and cost. In our previous work we have identified an experimentation framework for semi-automatic taxonomy/hierarchy generation from unstructured text. As observed in the preliminary results presented, the taxonomy/hierarchy quality was lower than we had anticipated. In this paper, we present two variations of our experimentation framework previously described, viz. Latent semantic Indexing (LSI) for document indexing and the use of term vectors to prune labels assigned to nodes in the final taxonomy/hierarchy. Using our previous results of taxonomy/hierarchy quality as the baseline we present results that demonstrate significant improvement in taxonomy/hierarchy label quality resulting from the above and present insights into the reason for the same,. Finally, we present a discussion on methods for further improving taxonomy/hierarchy quality.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TaxaMiner: an experimentation framework for automated taxonomy bootstrapping

Hierarchical taxonomies and thesauri are frequently used by content management systems for indexing, search and categorization. They are also being viewed as rudimentary ontologies for the emerging Semantic Web infrastructure. However, to date, development of taxonomies and thesauri are human intensive processes, requiring huge resources in terms of cost and time. It is critical that approaches...

متن کامل

Crowdsourced Semantic Matching of Multi-Label Annotations

Most multi-label domains lack an authoritative taxonomy. Therefore, different taxonomies are commonly used in the same domain, which results in complications. Although this situation occurs frequently, there has been little study of it using a principled statistical approach. Given that (1) different taxonomies used in the same domain are generally founded on the same latent semantic space, whe...

متن کامل

Evaluation of Background Knowledge for Latent Semantic Indexing Classification

This paper presents work that evaluates background knowledge for use in improving accuracy for text classification using Latent Semantic Indexing (LSI). LSI’s singular value decomposition process can be performed on a combination of training data and background knowledge. Intuitively, the closer the background knowledge is to the classification task, the more helpful it will be in terms of crea...

متن کامل

Using Random Indexing to improve Singular Value Decomposition for Latent Semantic Analysis

We present results from using Random Indexing for Latent Semantic Analysis to handle Singular Value Decomposition tractability issues. We compare Latent Semantic Analysis, Random Indexing and Latent Semantic Analysis on Random Indexing reduced matrices. In this study we use a corpus comprising 1003 documents from the MEDLINE-corpus. Our results show that Latent Semantic Analysis on Random Index...

متن کامل

Ensemble Approaches for Large-Scale Multi-Label Classification and Question Answering in Biomedicine

This paper documents the systems that we developed for our participation in the BioASQ 2014 large-scale bio-medical semantic indexing and question answering challenge. For the large-scale semantic indexing task, we employed a novel multi-label ensemble method consisting of support vector machines, labeled Latent Dirichlet Allocation models and meta-models predicting the number of relevant label...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004